Search CORE

110 research outputs found

Improving Statistical Machine Translation Accuracy Using Bilingual Lexicon Extraction with Paraphrases

Author: Chu Chenhui
Kurohashi Sadao
Nakazawa Toshiaki
Publication venue: Department of Linguistics, Faculty of Arts, Chulalongkorn University
Publication date: 01/01/2014
Field of study

Statistical machine translation (SMT) suffers from the accuracy problem that the translation pairs and their feature scores in the transla-tion model can be inaccurate. The accuracy problem is caused by the quality of the unsu-pervised methods used for translation model learning. Previous studies propose estimating comparable features for the translation pairs in the translation model from comparable cor-pora, to improve the accuracy of the transla-tion model. Comparable feature estimation is based on bilingual lexicon extraction (BLE) technology. However, BLE suffers from the data sparseness problem, which makes the comparable features inaccurate. In this paper, we propose using paraphrases to address this problem. Paraphrases are used to smooth the vectors used in comparable feature estimation with BLE. In this way, we improve the qual-ity of comparable features, which can improve the accuracy of the translation model thus im-prove SMT performance. Experiments con-ducted on Chinese-English phrase-based SMT (PBSMT) verify the effectiveness of our pro-posed method.

CiteSeerX

Waseda University Repository

Institutional Repositories DataBase (IRDB)

Integrated Parallel Sentence and Fragment Extraction from Comparable Corpora: A Case Study on Chinese--Japanese Wikipedia

Author: Chu Chenhui
Kurohashi Sadao
Nakazawa Toshiaki
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/02/2016
Field of study

Parallel corpora are crucial for statistical machine translation (SMT); however, they are quite scarce for most language pairs and domains. As comparable corpora are far more available, many studies have been conducted to extract either parallel sentences or fragments from them for SMT. In this article, we propose an integrated system to extract both parallel sentences and fragments from comparable corpora. We first apply parallel sentence extraction to identify parallel sentences from comparable sentences. We then extract parallel fragments from the comparable sentences. Parallel sentence extraction is based on a parallel sentence candidate filter and classifier for parallel sentence identification. We improve it by proposing a novel filtering strategy and three novel feature sets for classification. Previous studies have found it difficult to accurately extract parallel fragments from comparable sentences. We propose an accurate parallel fragment extraction method that uses an alignment model to locate the parallel fragment candidates and an accurate lexicon-based filter to identify the truly parallel fragments. A case study on the Chinese--Japanese Wikipedia indicates that our proposed methods outperform previously proposed methods, and the parallel data extracted by our system significantly improves SMT performance

Kyoto University Research Information Repository

Pivot-Based Topic Models for Low-Resource Lexicon Extraction

Author: Kurohashi Sadao
Nakazawa Toshiaki
Richardson John
Publication venue
Publication date: 01/01/2015
Field of study

Waseda University Repository

Designing the Business Conversation Corpus

Author: Li Tong
Nakazawa Toshiaki
Ri Ryokan
Rikters Matīss
Publication venue
Publication date: 01/01/2019
Field of study

While the progress of machine translation of written text has come far in the past several years thanks to the increasing availability of parallel corpora and corpora-based training technologies, automatic translation of spoken text and dialogues remains challenging even for modern systems. In this paper, we aim to boost the machine translation quality of conversational texts by introducing a newly constructed Japanese-English business conversation parallel corpus. A detailed analysis of the corpus is provided along with challenging examples for automatic translation. We also experiment with adding the corpus in a machine translation training scenario and show how the resulting system benefits from its use

arXiv.org e-Print Archive

Crossref

Statistical phrase alignment model using dependency relation probability

Author: Sadao Kurohashi
Toshiaki Nakazawa
Publication venue
Publication date: 01/01/2009
Field of study

Abstract When aligning very different language pairs, the most important needs are the use of structural information and the capability of generating one-to-many or many-to-many correspondences. In this paper, we propose a novel phrase alignment method which models word or phrase dependency relations in dependency tree structures of source and target languages. The dependency relation model is a kind of tree-based reordering model, and can handle non-local reorderings which sequential word-based models often cannot handle properly. The model is also capable of estimating phrase correspondences automatically without any heuristic rules. Experimental results of alignment show that our model could achieve F-measure 1.7 points higher than the conventional word alignment model with symmetrization algorithms

CiteSeerX

Large-scale Dictionary Construction via Pivot-based Statistical Machine Translation with Significance Pruning and Neural Network Features

Author: Chu Chenhui
Cromieres Fabien
Dabre Raj
Kurohashi Sadao
Nakazawa Toshiaki
Publication venue
Publication date: 01/01/2015
Field of study

We present our ongoing work on large-scale Japanese-Chinese bilingual dictionary con-struction via pivot-based statistical machine translation. We utilize statistical significance pruning to control noisy translation pairs that are induced by pivoting. We construct a large dictionary which we manually verify to be of a high quality. We then use this dictionary and a parallel corpus to learn bilingual neural net-work language models to obtain features for reranking the n-best list, which leads to an ab-solute improvement of 5 % in accuracy when compared to a setting that does not use signif-icance pruning and reranking.

CiteSeerX

Waseda University Repository